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ABSTRACT 

It is common to estimate a distribution by means of a step function. 
Such estimates can be made continuous by connecting the left points of 
the steps with straight line segments. In this paper, the best 
estimator of this class is found for data which is exponentially 
distributed using minimum risk. The risk is then compared with those 
of the sample distribution function and the Pyke estimator. 
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I. INTRODUCTION 



The motivation behind this thesis is to provide a contribution 
towards the objective of finding the best way to estimate an unknown 
continuous distribution. The cue is taken from References 4 and 5 
which have adopted the following principles: 

(1) The estimator itself should be a continuous function. 

(2) The estimator should be simple and natural . 

The first principle has not been generally adopted. The modern 
textbook solution [Ref. 3] is to use the sample distribution (step) 
function. A simple and natural extension of the step function, 
satisfying both of the above principles, is to connect the left end 
of the steps with straight line segments, following the example of the 
so-called Ogive curve. Thus, the class of estimators can be described 
by: 

(a) plot the points (C r ,X r ), r = 1, ... ,n 

(b) connect these points with straight line segments where 0 < C-j 
s C 2 < • . . s C n <: 1 are constants determined by some rule and X-| < X 2 
... < X n are the order statistics of a random sample of size n drawn 
from some unknown CDF, F. The only issue to be resolved is how to 

determine the sequence [c.^ , . 

I 1)1=1 

Reference 6 suggests using C r = , refering to the result as the 

Pyke estimator, and shows that the expected squared error of a continuous 
Pyke estimator for some distribution, F, is no larger than the expected 
squared error of the sample distribution function for a sufficiently 
large sample size. It is also shown that the Pyke estimator strictly 
dominates the sample distribution function for sample sizes greater 
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than or equal to one. The risk function used is the integrated 
expected squared error, a popular choice [Ref. 1]. 

In Ref. 4, the sequence J j 1 ?.^ for which the risk is minimized 
is found for the case where F is the uniform distribution. Compari- 
sons of this optimal risk with that of the sample distribution function 
and the Pyke estimator are made. The result is that the risk of the 
Pyke estimator is significantly closer to the optimal than the risk of 
the sample distribution function. 

The purpose of this thesis is to determine if the risk of the Pyke 
estimator remains closer to the optimal than that of the sample distribu- 
tion function for a random sample drawn from the exponential distribution. 
The approach will be similar to that used in Ref. 4. In an attempt to 
provide continuity for the reader, the notation of Ref. 4 is used 
wherever possible. 



7 



II. DETERMINATION OF THE MINIMUM RISK COEFFICIENTS 



Let Cq, C-| , C n , C n+ -| be an increasing sequence with Cq = 0 

and C n+ i = 1. The continuous function estimator for F(x) can be 
defined by: 



x - X 

H( x ) = C + y f- 

r r+l r 



(C 



r+l 



C r ) X r s x <; x r+1 , 



r = 0, 1, n (2.1) 

where X-| < ••• - are the order statistics from an absol utly 

continuous distribution F(x). Assume that the population is lower 
bounded and contained in the interval [0,~)> then define Xq = 0, 
x n+ 1 = 00 and the risk function by: 



R = E / |F(x) - H(x) 
0 L 



dF(x) 



( 2 . 2 ) 



H(x) is defined piecewise according to which of the random intervals 
( vw contains x. Assume that the sample is extracted from an 
exponential distribution. For convenience, let the distribution's 
mean equal one."* Let u, v with u < v be the variables in the sample 
space of (X^X^i). Their joint density function is: 



f r.r + l (u - v > 



n. n 

(r-1 ) ! (n-r-1 ) ! u 

for 0 < u < v < ® 



„ -u ^r-l -u -v(n-r) 
- e ) e e 

and 1 £ r ^ n-1 . 



The value of the mean does not matter since the risk does not 
change with linear transformations of the basic random variable. 
(Personal communication from Professor R.R. Read) 
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The end point densities can be derived as: 



f o , i ( 0 => v ) = ne " vn > 0 



< V < 00 



-iKn-l 



f n,n+l( U ’ 00 ) = ne " U ( 1 T 0 < u < °o 



Define A p = C r+] - C r , 
then: 



Rewriting (2.2) 

. ■ E Iff f 

r=0 0<u<x<v<coi 



r-1 

V = D A j 

r j=0 J 



(2.4) 

(2.5) 



1 - e" x - C - — A 
r v-u r 



e" x f . (u,v)dudvdx 
r,r+l 



( 2 . 21 ) 



The minimum risk coefficients, C , can now be found using classical 
optimization techniques. Using the Lagrangain form, the problem 
becomes : 



n 

Minimize: $ = R - x( X) A. - 1) 

j=0 3 



n 

Subject to: X] A. = 1 

j=0 3 



where X is the Lagrange Multiplier. Thus, the approach becomes: 



Set 




0 for k = 0, 1, ...,n and solve for C^. 



3$ 

3A k 



Iff 



0^us.xs:v<“ 



r x " u a 
L k ' v-u A k 



^" Xf k,k + i< u ' v ) dudvdx 



-2E Iff fl 

r=k+l 0<uix^v<“l 




x-u . 
v-u A r 



e" x f r ^ r+ i (u ,v)dudvdx 



+ X 
= 0 



( 2 . 6 ) 



which after laborious but straightforward integration leads to: 



C k j^ 2 (n-k) (n-k+ 1 ) - ( 2 n- 2 k+l )(n-k+l)(n-k)ln^^l - C k+1 

t 



(n-k)(2n-2k+l ) - 2 ( n-k ) 2 ( n-k+1 )ln n ^- 



n 

+ E 

r=k+l 



CvJ ( n -r + l ) (n-r)1n n ~^ - (n-r+1) 



n-r 



n 

E c r+1 

r=K+l r 1 



(n-r+1 ) (n-r) ln-- ^ - 



(n-r) 



_ (n-k+2)(n-k+l)(n-k), n-k+2 ( n-k+1 ) (n-k) , . + ,* 

4l?iT2r ln_ THT 2 ( n+2 ) (n k+1) 



(n-k)ln^U (n-k) - £ ppjif " 7 ( n+1 ) 

n k r=k+] n+z c 



( 2 . 7 ) 



For k = n, ( 2 . 7 ) implies X = 0 (the indeterminate form -(n-k)ln(n-k) 
is zero by use of ( 2 . 4 ) and ( 2 . 5 )). 

Using ( 2 . 5 ) in ( 2 . 6 ) for k = n provides: 



||-= Jf (A n - e" x )e _x ne" u (l-e" u ) n " 1 dudx = 0 (2.8) 

n 0<u<:x«=° 

which imples: -| 

C n = n+2 • 

To simplify notation, let: 

F-j ( k) = (n-k+1 )( n-k ) 1 n^^ - 2(n-k+2) (n-k+1 ) + 2(n-k+2) (n-k+1 ) 2 



In 



n-k+2 

n-k+1 



F 2 (k) = 2 ( n-k ) ( n-k+1 ) - ( 2 n- 2 k+l)(n-k+l)(n-k)ln n ^- 
F 3 (k) = (n-k+l)(n-k)ln^L - (n-k+ 2 ) (n-k +1 )lnj^- 



H(k) 



(n-k+2) (n-k+1) (n-k) ^n-k+2 (n-k+1) (n-k) 

4 (n+ 2 ) m n-k 2 Tn+ 2 l 



In 



n-k+1 

n-k 



+ 



(n-k) 



n 

E 

r=k+l 



r+1 X(n+1) 

n+2 ‘ 2 



(n-k+1 ) (n-k) 
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Equation (2.7) may be written in matrix form: 



A C = H (2.9) 

where : 



F-j ( 1 ) 


f 3 (2) 


F 3 (3) . , 


. . F 3 (n-1) 


F 3 ( n ) 


f 2 (D 


F 1 (2) 


f 3 ( 3 ) . 


. F 3 (n-1) 


F 3 (n) 


0 


F 2 (2) 


FtO) . 


. F 3 (n-1) 


F 3 (n) 


0 


0 


F 2 (3) . 


• F 3 (n-1) 


f 3 ("> 


0 


o' 


0 


• F 3 (n-1) 


F 3 (n) 



A = 



0 


0 


0 


. . . F 3 (n-1) 


f 3 (") 


0 


0 


0 


. . . F-| (n-1 ) 


F 3 (n) 


0 


0 


0 


. . . F 2 (n-1) 


F,(n) 





C ! 




H( 0) 




C 2 




H(l) 


c = 




and H_ = 


• 




C n+1 




H(n) 



Successive row subtraction with the last row and the last column 
discarded (since C n has already been determined) results in A 
becoming a tri -diagonal matrix of the form: 
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G-| ( 1 ) 


G 2 (l) 


0 


0 


. . 0 


0 


0 


g 2 (D 


G-j (2) 


G 2 (2) 


0 


. . 0 


0 


0 


0 


G 2 (2) 


6^3) 


G 0 (3) . 


. . 0 


0 


0 


0 


0 


G 2 (3) 


6,(4) . 


. . 0 


0 


0 


0 


0 


0 


G 2 ( 4 ) . 


. . 0 


0 


0 


0 


0 


0 


0 . 


. . G-| (n-3) 


G 2 (n-3) 


0 


0 


0 


0 


0 . 


. . G 2 (n-3) 


G-| (n-2) 


G 2 (n-2) 


0 


0 


0 


0 . 


. . 0 


G 2 (n-2) 


G-| (n-1 ) 



where : 



and 



H' = 



G-|(k) - 


F,0<) 


- F 2 (k) 




G 2 (k) = 


F 3 (k) 


- F,(k) 


= F 2 (k-1) 


H 1 ( 1 ) 








H'(2) 


where 


H'(k) = 


H(k-l) - H(k) 


H'(n-l) 









Equation (2.9) becomes: 



A' C = H' (2.91) 

Equation (2.91) represents a 2nd order linear difference equation 
with variable coefficients. An explicit solution appears to be out 
of reach, forcing the use of numerical methods. 

This matrix equation can be solved by triangular decomposition for 
any given sample size. The algorithm for this deterministic 
solution and the associated Fortran subroutine, TRID, are described 
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in Ref. 2. Coefficients for various sample sizes were computed using 
TRID and are shown with the corresponding Pyke coefficients in Table I. 
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TABLE I. COEFFICIENTS OF THE CONTINUOUS ESTIMATORS 



c 


MINIMUM RISK 


PYKE 


1 


0.6666667 


0.5000000 


1 


0.3383387 


0.3333333 


2 


0.7500000 


0.6666667 


1 


0.3432969 


0.2500000 


2 


0.4987318 


0.5000000 


3 


0.8000000 


0.7500000 


1 


0.2800868 


0.2000000 


2 


0.4240374 


0.4000000 


3 


0.5887811 


0.6000000 


4 


0.8333333 


0.8000000 


1 


0.2394752 


0.1666667 


2 


0.3578057 


0.3333333 


3 


0.5126778 


0.5000000 


4 


0.6460357 


0.6666667 


5 


0.8571429 


0.8333333 


1 


0.2085332 


0.1428571 


2 


0.3125745 


0.2857143 


3 


0.4437924 


0.4285714 


4 


0.5721471 


0.5714286 


5 


0.6906526 


0.7142857 


6 


0.8750000 


0.8571429 


1 


0.1848704 


0.1250000 


2 


0.2769643 


0.2500000 


3 


0.3940854 


0.3750000 


4 


0.5042591 


0.5000000 


5 


0.6200241 


0.6250000 


6 


0.7249480 


0.7500000 


7 


0.8888889 


0.8750000 


1 


0.1660087 


0.1111111 


2 


0.2488631 


0.2222222 


3 


0.3539021 


0.3333333 


4 


0.4534583 


0.4444444 


5 


0.5541501 


0.5555556 


6 


0.6579414 


0.6666667 


7 


0.7524714 


0.7777778 


8 


0.9000000 


0.8888889 



c 


MINIMUM RISK 


PYKE 


1 


0.1 506598 


0.1000000 


2 


0.2259314 


0.2000000 


3 


0.3213671 


0.3000000 


4 


0.4115223 


0.4000000 


5 


0.5034373 


0.5000000 


6 


0.5946056 


0.6000000 


7 


0.6890570 


0.7000000 


8 


0.7749696 


0.8000000 


9 


0.9090909 


0.9000000 


1 


0.1379129 


0.0909091 


2 


0.2068969 


0.1818182 


3 


0.2943053 


0.2727273 


4 


0.3768931 


0.3636364 


5 


0.4608337 


0.4545455 


6 


0.5447461 


0.5454545 


7 


0.6284070 


0.6363636 


8 


0.7149642 


0.7272727 


9 


0.7937232 


0.8181818 


10 


0.9166667 


0.9090909 


1 


0.0747509 


0.0476190 


2 


0.1123907 


0.0952381 


3 


0.1599652 


0.1428571 


4 


0.2048848 


0.1904762 


5 


0.2505187 


0.2380952 


6 


0.2959665 


0.2857143 


7 


0.3414700 


0.3333333 


8 


0.3869660 


0.3809524 


9 


0.4324732 


0.4285714 


10 


0.4779891 


0.4761905 


11 


0.5235179 


0.5238095 


12 


0.5690642 


0.5714286 


13 


0.6146325 


0.6190476 


14 


0.6602411 


0.6666667 


15 


0.7058782 


0.7142857 


16 


0.7516879 


0.7619048 


17 


0.7973108 


0.8095238 


18 


0.8445265 


0.8571429 


19 


0.8874853 


0.9047619 


20 


0.9545455 


0.9523810 
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c 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 

21 

22 

23 

24 

25 

26 

27 

28 

29 

30 

31 

32 

33 

34 

35 

36 

37 

38 

39 

40 

41 

42 

43 

44 

45 

46 

47 

48 

49 

50 



MINIMUM RISK PYKE 



0.0315038 

0.0474483 

0.0675612 

0.0865575 

0.1058530 

0.1250684 

0.1443054 

0.1635367 

0.1827696 

0.2020022 

0.2212349 

0.2404678 

0.2597008 

0.2789339 

0.2981671 

0.3174005 

0.3366341 

0.3558678 

0.3751017 

0.3943358 

0.4135701 

0.4328046 

0.4520395 

0.4712746 

0.4905101 

0.5097460 

0.5289823 

0.5482191 

0.5674565 

0.5866945 

0.6059332 

0.6251728 

0.6444134 

0.6636552 

0.6828984 

0.7021433 

0.7213902 

0.7406397 

0.7598924 

0.7791493 

0.7984114 

0.8176810 

0.8369599 

0.8562559 

0.8755638 

0.8949449 

0.9142469 

0.9342227 

0.9523976 

0.9807692 



0.0196078 

0.0392157 

0.0588235 

0.0784314 

0.0980392 

0.1176471 

0.1372549 

0.1568627 

0.1764706 

0 .. 1960784 

0.2156863 

0.2352941 

0.2549020 

0.2745098 

0.2941176 

0.3137255 

0.3333333 

0.3529412 

0.3725490 

0.3921569 

0.4117647 

0.4313725 

0.4509804 

0.4705882 

0.4901961 

0.5098039 

0.5294118 

0.5490196 

0.5686275 

0.5882353 

0.6078431 

0.6274510 

0.6470588 

0.6666667 

0.6862745 

0.7058824 

0.7254902 

0.7450980 

0.7647059 

0.7843137 

0.8039216 

0.8235294 

0.8431373 

0.8627451 

0.8823529 

0.9019608 

0.9215686 

0.9411765 

0.9607843 

0.9803922 
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III. CALCULATION OF THE RISKS 



From equation (2.2), the risk function for the minimum risk 
coefficients, R(min), is: 



R(min) = E 
r=0 



in I' ■ 



e~ x - C 



x-u 



r v-u r 



e’ X f r)r+1 (u,v)dudvdx 



OsuiX^v<°° 



= E 

r=0 



n l(l-C r )2 _ 2(C r -l)(n-r+l) ( ( n -r+2)(n-r+l)(n-r) 

n+l (n+2) (n+1 ) r 2(n+2)(n+l) 



jmi!zr±2 

I n-r 



2 

n-r+2 



+ 2A r (C r -l) 



(n-r+I ) (n-r) 
n+l 



In 



n-r+1 

n-r 



1 

n-r+1 




(n-r+1 )(n-r) 
n+l 



2n-2r+l 

n-r+1 



2(n-r)ln 



n-r+1 

n-r 



(n-r+1 ) (n-r+2) ( 

(n+3)(n+2) (n+l ) I 

The risk function for the Pyke estimator, R(Pyke), is: 



(3.1) 



n 

R(Pyke) = E fff [l - e" x 



r _ x-u / 1 \ 
n+l ” v-u ^n+1 ' 



n ( 

= E ) 



0<u<x<v<“> 

1 



e" X f r ^ r+1 (u,v)dudvdx 



^0 ((n+l) 



j 5(n-r) + 5(n-r) f 1 



(n-r+1 )(3n-3r+l ) 
(n+1 ) 2 (n+2) 



+ (n-r+1 ) (n-r+2) _ 2 (n-r) (n-r+1 )(2n-2r+l ), n-r+1 



(n+3)(n+2)(n+l ) 



+ (n-r) (n-r+1 ) (n-r+2) ^ n-r+2 



2 ( n+1 )^(n+2) 



(n+1) 

■r+2 ) 
n-r j 



n-r 



(3.2) 
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The risk function for the sample distribution function, R(SDF), 



s : 

R(SDF) 



n 

E 

r=0 



iff 

(kusxsvc ^ 1 



1 - e" 



r_ 

n 



e " Xf r,r + l(' 



1 _ 
6n ' 



,v)dudvdx 

( 3 . 3 ) 
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IV. RESULTS AND CONCLUSIONS 



Values of each type risk were computed for various sample sizes 
using (3.1), (3.2) and (3.3) and are shown in Table II and Figure 1 
where it becomes apparent that the risk of the Pyke estimator is 
significantly closer to the optimum than is the risk of the sample 
distribution function. All the risks converge to zero at the rate 
1/n. Thus, n times the risk converges to a constant (1/6). It is 
interesting to compare the risks as they approach this asympotote. 
This is done in Table III and Figure 2. 

These results coupled with those of Ref. 4 suggest that, given 
the criteria of minimizing expected squared error, the Pyke estimator 
should be used in lieu of the sample distribution function, 
particularly if the underlying population is suspected to be either 
exponential or uniform. 
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N 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

12 

14 

16 

18 

20 

30 

40 

50 

60 

70 

80 

90 

100 



TABLE II 



VALUE OF RISK FUNCTIONS FOR SAMPLE SIZE N 



MINIMUM RISK 


PYKE RISK 


SDF RISK 


0.0480993 


0.0682656 


0.1666667 


0.0363529 


0.0394270 


0.0833333 


0.0276659 


0.0298089 


0.0555556 


0.0233801 


0.0246450 


0.0416667 


0.0203317 


0.0212262 


0.0333333 


0.0180346 


0.0187209 


0.0277778 


0.0162237 


0.0167779 


0.0238095 


0.0147538 


0.0152155 


0.0208333 


0.0135338 


0.0139269 


0.0185185 


0.0125036 


0.0128436 


0.0166667 


0.0108566 


0.0111197 


0.0138889 


0.0095964 


0.0098068 


0.0119048 


0.0086000 


0.0087725 


0.0104167 


0.0077919 


0.0079360 


0.0092593 


0.0071232 


0.0072455 


0.0083333 


0.0049866 


0.0050497 


0.0055556 


0.0038370 


0.0038755 


0.0041667 


0.0031184 


0.0031444 


0.0033333 


0.0026266 


0.0026453 


0.0027778 


0.0022689 


0.0022830 


0.0023810 


0.0019969 


0.0020079 


0.0020833 


0.0017832 


0.0017921 


0.0018519 


0.0016108 


0.0016181 


0.0016667 
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RISK 




SDF 

PYKE 

MIN 



N 



Figure 1 . 
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N 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

12 

14 

16 

18 

20 

30 

40 

50 

60 

70 

80 

90 

100 



TABLE III. VALUE OF THE RISK FUNCTIONS X SAMPLE SIZE 



N(MINIMUM RISK) 


N(PYKE RISK) 


N(SDF RISK) 


0.0480993 


0.0682656 


0.1666667 


0.0727058 


0.0788540 


0.1666667 


0.0829976 


0.0894267 


0.1666667 


0.0935202 


0.0985800 


0.1666667 


0.1016586 


0.1061310 


0.1666667 


0.1082079 


0.1123254 


0.1666667 


0.1135658 


0.1174451 


0.1666667 


0.1180301 


0.1217241 


0.1666667 


0.1218044 


0.1253423 


0.1666667 


0.1250362 


0.1284358 


0.1666667 


0.1302796 


0.1334367 


0.1666667 


0.1343494 


0.1372958 


0.1666667 


0.1375995 


0.1403596 


0.1666667 


0.1402547 
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